Distributed Principal Subspace Analysis for Partitioned Big Data: Algorithms, Analysis, and Implementation

نویسندگان

چکیده

Principal Subspace Analysis (PSA) -- and its sibling, Component (PCA) is one of the most popular approaches for dimensionality reduction in signal processing machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant modern era big data, which number samples and/or often exceed storage computational capabilities individual machines. This has led to study distributed solutions, data partitioned across multiple machines an estimate principal subspace obtained through collaboration among It this vein that paper revisits problem under general framework arbitrarily connected network lacks a central server. The main contributions regard threefold. First, two algorithms proposed can be used PSA/PCA, with case other (raw) features. Second, sample-wise algorithm variant it analyzed, their convergence true at linear rates established. Third, extensive experiments on both synthetic real-world carried out validate usefulness algorithms. In particular, MPI-based implementation interplay between topology communications cost as well effects straggler

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A covariance-free iterative algorithm for distributed principal component analysis on vertically partitioned data

In this paper, a covariance-free iterative algorithm is developed to achieve distributed principal component analysis on high-dimensional data sets that are vertically partitioned. We have proved that our iterative algorithm converges monotonously with an exponential rate. Different from existing techniques that aim at approximating the global PCA, our covariance-free iterative distributed PCA ...

متن کامل

Principal and Minor Subspace Tracking: Algorithms & Stability Analysis

We consider the problem of tracking the minor or principal subspace of a positive Hermitian covariance matrix. We first propose a fast and numerically robust implementation of Oja algorithm (FOOja: Fast Orthogonal Oja). The latter is said fast in the sense that its computational cost is of order O(np) flops per iteration where n is the size of the observation vector and p < n is the number of m...

متن کامل

MapReduce Algorithms for Big Data Analysis

There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google’s MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based...

متن کامل

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis

We study the fundamental problem of Principal Component Analysis in a statistical distributed setting in which each machine out of m stores a sample of n points sampled i.i.d. from a single unknown distribution. We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of...

متن کامل

Principal Component Analysis and Higher Correlations for Distributed Data

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrixA = A+A+. . .+A, with matrix A stored...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Signal and Information Processing over Networks

سال: 2021

ISSN: ['2373-776X', '2373-7778']

DOI: https://doi.org/10.1109/tsipn.2021.3122297